Triton 프로그래밍 입문: 즉시 실행 연산자에서 블록 기반 병렬 처리로

전환 시에는 PyTorch 즉시 실행 모드 에서 Triton 텐서를 단일 객체로 보는 관점을 블록 또는 타일 형태의 분리되고 조작 가능한 집합으로 보는 것으로 전환해야 합니다. 블록 또는 타일로 보는 것입니다.

1. PyTorch와 Triton 텐서의 비교

다음 두 가지를 구분하는 것이 중요합니다: Triton 텐서 과 PyTorch 텐서. PyTorch 텐서는 호스트 측 파이썬 객체 형태, 데이터 유형, 장치, 스트라이드 및 저장 메타데이터를 포함한 객체입니다. 반면, Triton은 특정 메모리 블록 내의 원시 데이터 포인터 를 사용하여 매우 낮은 수준의 최적화가 가능합니다.

2. 즉시 실행의 성능 저하 문제

기본적인 즉시 실행 방식에서는 모든 연산(예: 덧셈 후 리유럴)마다 별도의 커널 시작과 글로벌 메모리 주고받기가 필요합니다. 이는 현대 그래픽 처리 장치(GPU) 계산에서 가장 큰 성능 저하 요인입니다. Triton은 하나의 커널 내에서 연산을 융합함으로써 블록 단위의 데이터(예: 128, 256, 또는 512개 원소)를 직접 내부 메모리에서 처리하는 방식으로 이를 극복합니다.

3. 블록 기반 사고 방식

CUDA 스레드의 스칼라 수준 사고 방식이 아니라, Triton은 SPMD(Single Program, Multiple Data) 블록 수준에서 작동합니다. 하나의 커널만 작성하면, Triton은 격자에 걸쳐 여러 인스턴스를 동시에 시작합니다. 각 인스턴스는 자신의 program_id 를 사용하여 자신이 소유하고 있는 '조각' 메모리 영역을 계산합니다.

4. 환경 설정

시작하기 위해, 정리된 환경에서 Triton을 설치 (Conda 또는 venv 사용) 기존의 CUDA 도구 키트와의 의존성 충돌을 피하기 위해: pip install triton.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary difference between a PyTorch tensor and a Triton tensor within a kernel?

Triton tensors contain Python metadata like strides; PyTorch tensors are raw pointers.

A PyTorch tensor is a host-side object wrapping metadata; a Triton tensor represents blocks of data processed at the compiler level.

There is no difference; they are the same object.

Triton tensors are stored on the CPU, while PyTorch tensors are on the GPU.

QUESTION 2

Why is 'Eager Mode' considered a bottleneck for modern GPU performance?

Because it uses too much CPU memory.

Every operation requires a separate kernel launch and a global memory round-trip.

It cannot handle floating-point numbers.

It lacks support for the Python language.

QUESTION 3

What is the result of installing Triton in a 'dirty' environment with conflicting CUDA toolkits?

Triton will automatically fix the CUDA path.

It may lead to library version mismatches and kernel compilation errors.

The GPU will run faster due to multiple toolkit options.

Triton does not use CUDA, so there is no conflict.

QUESTION 4

Draw the mapping from pid to index range for N=1000, BLOCK_SIZE=256.

pid 0: [0, 256); pid 1: [256, 512); pid 2: [512, 768); pid 3: [768, 1000)

pid 0: [0, 1000)

pid 0: [0, 256); pid 1: [257, 512); pid 2: [513, 768); pid 3: [769, 1000)

pid 1: [0, 256); pid 2: [256, 512); pid 3: [512, 768); pid 4: [768, 1000)

QUESTION 5

In block-based parallelism, the instruction shift moves from 'compute one element' to:

'Compute one entire tensor'.

'Compute one block of 128/256/512 elements'.

'Compute one scalar at a time'.

'Let the CPU handle the math'.